Document-Centric Analysis

In this project, a platform for the automated analysis of PDF documents was developed. The platform offers a variety of processing and analysis options that allow large volumes of documents to be analyzed and classified efficiently. The development focused on modern machine learning methods, natural language processing (NLP) and the creation of a knowledge graph.

Processing and analysis options #

1. Splitting of merged PDF files #

A central feature of the platform is the ability to split merged PDF files into separate documents. Various classification algorithms were developed for this purpose, which recognize the start and end pages of the individual documents. By classifying the pages, the exact point of transition from one document to the next could be determined.

2. Topic classification #

A Topic Model based on Latent Dirichlet Allocation (LDA) was used for the thematic classification of the documents. LDA is a probabilistic model that assigns topics to documents based on the distribution of words within the texts. This enabled an efficient thematic classification of the documents.

3. Sentiment analysis #

Another feature of the platform is the sentiment analysis, which determines the emotional tendency of the texts (positive, negative, neutral). A specialized BERT model was used for this purpose. It was obtained from Huggingface and enables a precise analysis of the text at sentence level in order to e valuate the emotional orientation of the texts.

4. Proper name recognition #

The Python library Spacy with a supplied model was used for Named Entity Recognition (NER). This was used to recognize and extract proper names such as personal names, places and organizations in the documents. NER made it possible to identify important entities within the documents, which was particularly helpful for further processing and building the knowledge graph.

5. Knowledge Graph #

Based on the recognized entities and the underlying grammatical structures, a Knowledge Graph was created. For this purpose, grammars and semantic relationships were extracted from the documents using Spacy. The knowledge graph made it possible to represent complex relationships between entities and to extract knowledge from the analyzed documents.

Architecture of the platform #

The platform is based on a microservices architecture in which various analysis services were provided separately. For communication between the microservices, Kafka was used to efficiently distribute and process the analysis results.

Web frontend #

A user-friendly web frontend was developed with Vue.js and Vuetify. The frontend allowed users to select different analysis options and display the analysis results in a clear form. These included thematic classifications, sentiment analyses and the Knowledge Graph.

Deployment #

The entire platform was deployed using Docker containers to run the individual microservices in an isolated and scalable environment. The Docker containers were hosted on a virtual machine (VM) to ensure a flexible and portable infrastructure.

Conclusion #

The development of this platform has provided a powerful and scalable solution for analyzing PDF documents. The combination of topic classification, sentiment analysis, proper name recognition and the creation of a knowledge graph made it possible to gain in-depth insights from large volumes of documents. The microservices architecture, coupled with the use of Kafka and Docker, ensured efficient workload distribution and a robust infrastructure.

Activities #

Implementation of a Python script to orchestrate Kafka messages to various application services to ensure that they are executed in the correct order
Implementation of a web front end including data visualizations of individual analysis components with Vue.js
Implementation and evaluation of AI procedures for splitting PDF documents into individual sub-documents
Integration of an AI for sentiment analysis of documents, based on the sentiment of of individual sentences
Extraction of subjects, objects and relations from documents to generate a knowledge graph
Containerization of the application with Docker and commissioning of the platform on a server